Development of a deep learning model that predicts Bi-level positive airway pressure failure

Delaying intubation for patients failing Bi-Level Positive Airway Pressure (BIPAP) may be associated with harm. The objective of this study was to develop a deep learning model capable of aiding clinical decision making by predicting Bi-Level Positive Airway Pressure (BIPAP) failure. This was a retrospective cohort study in a tertiary pediatric intensive care unit (PICU) between 2010 and 2020. Three machine learning models were developed to predict BIPAP failure: two logistic regression models and one deep learning model, a recurrent neural network with a Long Short-Term Memory (LSTM-RNN) architecture. Model performance was evaluated in a holdout test set. 175 (27.7%) of 630 total BIPAP sessions were BIPAP failures. Patients in the BIPAP failure group were on BIPAP for a median of 32.8 (9.2–91.3) hours prior to intubation. Late BIPAP failure (intubation after using BIPAP > 24 h) patients had fewer 28-day Ventilator Free Days (13.40 [0.68–20.96]), longer ICU length of stay and more post-extubation BIPAP days compared to those who were intubated ≤ 24 h from BIPAP initiation. An AUROC above 0.5 indicates that a model has extracted new information, potentially valuable to the clinical team, about BIPAP failure. Within 6 h of BIPAP initiation, the LSTM-RNN model predicted which patients were likely to fail BIPAP with an AUROC of 0.81 (0.80, 0.82), superior to all other models. Within 6 h of BIPAP initiation, the LSTM-RNN model would identify nearly 80% of BIPAP failures with a 50% false alarm rate, equal to an NNA of 2. In conclusion, a deep learning method using readily available data from the electronic health record can identify which patients on BIPAP are likely to fail with good discrimination, oftentimes days before they are intubated in usual practice.

Bi-level Positive Airway Pressure (BIPAP) is a form of non-invasive ventilation (NIV) increasingly used in adults and children with acute respiratory failure [1][2][3] . BIPAP can assist in lung recruitment, offload respiratory muscle work, and improve gas exchange 4 . BIPAP is an alternative to endotracheal intubation in many circumstances, although BIPAP failure frequently occurs, particularly in patients with hypoxemic respiratory failure and lung injury 5 .
Currently, the decision to intubate a child on BIPAP is primarily driven by clinical judgment. In children, severity of oxygenation abnormalities (such as the SpO2/FiO2) are also associated with intubation risk, although these metrics are neither sensitive nor specific 22,23 . For adults, the HACOR scale was developed to predict NIV failure in hypoxemic patients by comparing measurements of five variables -heart rate, pH, Glasgow Coma Score (GCS), PaO2/FiO2 ratio, and respiratory rate-to reference values 24 . Electronic Medical Records (EMR) and advanced machine learning (ML) methods provide an opportunity for timely and accurate identification Scientific Reports | (2022) 12:8907 | https://doi.org/10.1038/s41598-022-12984-x www.nature.com/scientificreports/ of children likely to fail BIPAP. ML models can incorporate hundreds of variables, and deep learning neural networks, a subset of ML algorithms, can be trained to predict temporally evolving targets, including a patient's medical status [25][26][27][28][29][30] . The objective of this study was to develop a deep learning neural network model capable of continuously predicting BIPAP failure following BIPAP initiation in critically ill children and compare it to multivariate logistic regression models using data available in the EMR. Additional objectives were to determine the characteristics of patients who fail BIPAP and investigate the impact of the timing of BIPAP failure on outcomes including Ventilator-free Days (VFDs), ICU length of stay (LOS), and mortality.

Material and methods
Data sources and variables. The  The number of Ventilator-free Days (VFD) at 28 days was calculated two ways: 28-VFD Invasive Mechanical Ventilation (IMV) was defined as the number of days within a 28-day time frame after IMV initiation that the patient was alive and not on IMV. Successful extubation from IMV was defined as not requiring re-intubation within 48 h of extubation 32,33 . 28-VFD IMV BIPAP was defined as the number of days within a 28-day time frame after BIPAP initiation that the patient was alive and neither on IMV nor BIPAP. Successful weaning from BIPAP was defined as not requiring BIPAP or IMV within 48 h of BIPAP termination. All patients with greater than or equal to 28 ventilator days and those who died within 28 days of mechanical ventilation were assigned 0 VFD (Examples of VFD calculation can be found in Supplementary Fig. 1).

Statistical analyses to characterize BIPAP failures.
The non-parametric Mann-Whitney U test was used to compare demographic variables and characteristics-vital signs, laboratory results, pediatric chronic complex conditions diagnoses (based on ICD-10) -between BIPAP failures and non-failures. Patients who met the early, intermediate, or late BIPAP failure definitions were characterized and compared in terms of their 28-VFD and 28-VFD IMV BIPAP. Post-extubation BIPAP and ICU LOS were also evaluated as outcome measures. The non-parametric Kruskal-Wallis H test was used for the comparisons among the three BIPAP failure groups. All statistical analyses were completed using Sci.Py 1.4.1 in Python 3.7.4. Data preprocessing. EMR measurements are asynchronously and irregularly charted. Pre-processing techniques described in previous work converted these measurements and other patient data into a matrix format amenable to machine learning. At any time when at least one variable had a recorded value, the missing values for other variables were imputed, and the process followed that of prior work 26,27 . Missing drug or intervention measurements were imputed with zero to indicate absence of treatment. When a physiologic observation or lab measurement was available, it was propagated forward until another measurement was recorded. When prior measurements were not available, the variable was imputed using the training set population mean. The S/F ratio, defined as SpO 2 over FiO 2 , was used when SpO 2 was between 80 and 97% as a measure of a patient's oxygenation 34 . FiO 2 was treated the same regardless of BIPAP mask interface (nasal, oro-nasal, and total face mask). At any time when SpO 2 fell outside of the interval (80%, 97%), the S/F ratio was forward filled from the last validly computed S/F ratio. Additional details can be found in Supplementary Fig. 2.
Model development for predicting BIPAP failure. Prior to model development, the cohort of BIPAP sessions was partitioned into a training (80%) and a test (holdout) set (20%) for assessing model performance.
Partitioning was done such that all BIPAP sessions of a single patient belonged to one of the sets. No other stratifications were applied. Three machine learning models were developed to predict BIPAP failure each time a new set of observations became available: two logistic regression models and a deep learning model-a recurrent neural network with a Long Short-Term Memory (LSTM-RNN) architecture [25][26][27] . The SpO 2 /FiO 2 (S/F) ratio was evaluated as a reference model because it has been described as a useful outcome predictor in children as early as  23 . The inputs to the first logistic regression model (LR HACOR ) included four of the five HACOR scale input variables: heart rate, pH, GCS, and respiratory rate 24 . The pH values for LR HACOR came from the arterial, capillary, and venous blood gas (ABG, CBG, VBG) measurements. The fifth HACOR scale variable, PaO2/FiO2 ratio, was replaced in LA HACOR by the S/F ratio, a reliable noninvasive surrogate of the PaO2/FiO2 ratio, because SpO2 was more frequently available than PaO2 23,34 . Finally, SpO2 was also an LR HACOR input. The second logistic regression model (LR EMR ) used 301 variables representing vital signs, laboratory results, medications (antibiotics, vasopressors, inotropes, diuretics, sedatives, etc.), respiratory support (supplemental oxygen, BIPAP and ventilator settings), invasive procedures, radiography, and nursing assessments (see Supplementary  Table 1 and Supplementary Table 2). It has been previously demonstrated that extraneous features from the EMR does not degrade LSTM-RNN performance 35 . The LSTM-RNN model has feedback/forward connections that allow it to process time series data in a sequential manner and integrate information from previous times with newly available inputs to inform current predictions. The LSTM-RNN model used the same 301 input variables as the LR EMR ( Supplementary Fig. 3 illustrates the flow of inputs into and outputs from the models, Supplementary Table 3 shows the hyperparameters of the LSTM-RNN). The two LR models were developed as comparators for the LSTM-RNN model, with LR EMR serving as a bridge between LR HACOR and the LSTM-RNN model. Model evaluation. Predicting BIPAP failure. Models were evaluated and compared for performance using the area under the receiver operating characteristic curve (AUROC). AUROC was computed for predictions at various hours (1, 2, …, 24) after BIPAP initiation (n-hour AUROCs). Performance was reported by bootstrap sampling the evaluation set 100 times, with each bootstrap iteration randomly selecting 75% of the BIPAP sessions without replacement in the set and calculating the AUROC of model predictions in that draw. The mean and 95% confidence interval of the AUROC scores were then calculated to report average performance and estimate population variance.
In the rolling cohort n-hour AUROCs, failures or successes that already occurred before the next hour of evaluation were excluded from the computation; thus, the number of cases used in the AUROC computation decreased over time. Number needed to alert (NNA) 36 was also used to evaluate the predictions at 6 and 24 h after BIPAP initiation. NNA is defined as the sum of true positives and false positives divided by the number of true positives for any specific threshold. Note that NNA is the inverse of positive predictive value.
Subgroup analysis: hypoxemic patients. Model performance in patients with a low S/F ratio at the time of 6-h and 24-h predictions were evaluated to determine whether the model better discriminates BIPAP failure in hypoxemic patients than in non-hypoxemic patients. S/F ratio was considered low when it was less than 264, which is a diagnostic criterion for Pediatric Acute Respiratory Distress Syndrome (PARDS) for patients on NIV 37 .

Results
Demographics and characteristics of BIPAP cohort. The inclusion and exclusion criteria resulted in 630 BIPAP sessions, 175 (27.7%) of which met the definition for BIPAP failure: requiring escalation to invasive mechanical ventilation within 48 h of BIPAP termination (Fig. 1). Table 1 shows the characteristics of the entire cohort, partitioned into failures and non-failures. The average age in the two groups (10.7 years in the BIPAP success group and 9.9 years in the BIPAP failure group) were not significantly different. The BIPAP failure group was on BIPAP for a median of 32.8 (9.2-91.3) hours prior to intubation. In the BIPAP failure group, the median time interval between BIPAP termination and mechanical ventilation was 1.7 h (IQR 1.0-2.5 h). The BIPAP failure group had greater PRISM-III scores than the non-failure group. Among the BIPAP sessions lasting at least six hours, the respiratory rate at the 6th hour after BIPAP initiation of the BIPAP failure group was higher than that of the BIPAP success group (median of 31 vs 25 breaths per minute). The average CBG pH in the BIPAP failure group was significantly lower than in the success group. The average SpO 2 /FiO 2 ratio in the BIPAP failure group, 194, was significantly lower than in the BIPAP success group, 243. The median BIPAP Inspiratory Positive Airway Pressure (IPAP) was 16 Table 4), the BIPAP failure group had higher incidence rates of oncologic (15.4% vs 9.5%), rheumatologic (4.0% vs 1.3%), and metabolic (4.6% vs 1.5%) conditions. Model performance: continuous predictions over time on BIPAP. At all assessment times, the LSTM-RNN model discriminated better between BIPAP failure and non-failure than the other three models (Fig. 2, left). Counts of failures and non-failures at each assessment hour are in Supplementary  Table 6). Only the LSTM-RNN model performed better in hypoxemic patients than in the general cohort. The LSTM model demonstrated a consistent trend of lower NNA than the other models across sensitivity (Fig. 2, right). When operating at NNA = 2, the LSTM-RNN model identified nearly 80% of BIPAP failures within 6 h. For NNA = 1 (i.e., no false alarms) within 6 h of BIPAP initiation, the LSTM-RNN identified almost 25% of BIPAP failures (7 BIPAP episodes). All other models at this sensitivity had at least one false alarm. Figure 3 shows two BIPAP episodes where the LSTM-RNN predicted BIPAP failure after 6 h using the threshold of NNA = 1; they were intubated 11 and 56 h after the alarm. The other five BIPAP episodes correctly predicted to fail (with NNA = 1) were intubated 5, 21, 44, 51, and 94 h after the alarm.

Discussion
Identifying children likely to fail BIPAP is important for several reasons. First, there is increasing concern that patients on NIV with high respiratory effort may exacerbate injury to their own lungs, a term deemed patient self-inflicted lung injury 12 . The mechanisms of P-SILI have been well described and include lung stress with high transpulmonary pressure, lung strain with high tidal volume relative to end-expiratory lung volume, atelectrauma, and pendeluft [12][13][14][15][16][17][18] . If patients are breathing with these injurious patterns on BIPAP, then longer duration of exposure to BIPAP without reducing respiratory work with intubation, sedation, and or neuromuscular blockade may lead to lung injury progression. Importantly, our findings corroborate recent data that pediatric ARDS patients receiving pre-intubation BIPAP for greater than or equal to 24 h have longer lengths of PICU and hospital stay and higher 28-and 90-day mortality compared to ARDS patients who were either intubated Figure 1. Cohort selection process. The dataset contained 948 instances where BIPAP was initiated. All BIPAP sessions of a patient with a previous diagnosis of respirator/ventilator dependence or sleep apnea were excluded. The last BIPAP session of any patient with a DNI/DNR order or who was transferred out of the unit prior to the BIPAP being discontinued was excluded. BIPAP sessions that resulted in intubation within 48 h of BIPAP discontinuation were considered BIPAP failures. www.nature.com/scientificreports/ primarily or on BIPAP for less than 24 h 6 . In addition, children who are intubated after failing BIPAP have higher rates of complications during intubation such as desaturation, prolonged hypoxemia, and even cardiac arrest [6][7][8][9][10]21 . This evidence supports the hypothesis that timely identification and intervention in children likely to fail BIPAP may prevent complications. The HACOR scale was developed to predict NIV failure in hypoxemic adults, but there are no decision support tools (of which the authors are aware) for BIPAP failure in children. This study demonstrated that a deep learning model (LSTM-RNN) using readily available EMR data could identify children at risk for BIPAP failure. In the general cohort, the LSTM-RNN model achieved higher AUROCs than the two logistic regression models (one using the same inputs as the LSTM-RNN model and another using five physiologic variables predictive of NIV failure) and the S/F ratio at every assessment hour. In hypoxemic patients (S/F ratio < 264), a group of particular interest 23,24,34 because their risks of P-SILI are higher, the LSTM-RNN model also had AUROC than all other models. Interestingly, the LSTM-RNN model had higher AUROC in the hypoxemic group than in the general cohort, which was not the case for the other models.
It is important to note that the predictions are predicated on the clinical intervention deemed appropriate by the care team. The performance measures the ability of the models to predict BIPAP failure given that the clinical team determined it was appropriate to keep the child on BIPAP. A model with a 0.5 AUROC is equivalent      Model performance at hour 6 is noteworthy because a large proportion of children who ultimately fail BIPAP and get intubated do so more than 6 h after BIPAP initiation. Because interventions such as intubation have risks, a diagnostic tool in this domain must have a modest or low false alarm rate. The NNA vs detection plot in Fig. 2 demonstrates the advantage of the LSTM-RNN model over the other models. Nearly 80% of the BIPAP failures can be identified within 6 h, with an NNA of 2. At an NNA of 1 (i.e., no false alarms), the LSTM-RNN identified almost 25% of the BIPAP failures with a median time to failure of 44 h from the alarm. The LSTM-RNN can prompt clinicians to take a closer look at these high-risk patients to determine the best course of action, which may include intubation or adjusting BIPAP settings.
Many clinicians regularly follow variables such as the S/F ratio to gauge clinical response to BIPAP, which in these experiments had some prognostic relevance, but had more false alarms than the LSTM-RNN model. The LR models learned meaningful relationships among all available EMR variables and the target outcome to make better predictions. The LSTM-RNN model's ability to include time dependencies and multiple, unselected variables resulted in superior performance in finding and understanding relationships between the patient's dynamic clinical state and the target outcome of BIPAP failure. Furthermore, the results demonstrated that the LSTM-RNN model can learn over time as the patient's condition changes to continuously improve its prediction of BIPAP failure. Model performance at 6 and 24 h were highlighted, but in fact the model can generate a prediction as soon as enough data is available (i.e., even within the first hour). This may be important in future applications to identify patients failing within the first hour of BIPAP. The results of this proof-of-concept study demonstrate the feasibility of analyzing critical care data with advanced ML methods to provide clinical decision support. A tool could be integrated into the clinical workflow, either as part of bedside monitoring or a webtool easily accessed by clinicians to obtain dynamic predictions on important patient outcomes. The actual design and implementation of such tools require careful understanding of many different areas well outside the scope of this study 38,39 .
This study had a few important limitations due to its single center, retrospective nature. BIPAP and intubation practices, including the selection of airway pressure and supplemental oxygen settings, may be different at our institution compared to others, which may affect generalizability. Additionally, to generate the sample size needed for the analysis, we used data spanning 10 years, and there may have been important changes to practice with respect to non-invasive ventilation for respiratory failure. We did not have information on the type of BIPAP interface used (oral versus nasal). Another limitation was that the RNN was trained for a target outcome of BIPAP failure that was subjectively biased as a complex human bedside clinician decision. Finally, we excluded patients who may be on home BIPAP (i.e., excluding patients with a diagnosis of obstructive sleep apnea).

Conclusions
A machine learning model using electronic health record data from children on BIPAP can identify children with high likelihood of failing BIPAP, with a relatively modest false alarm rate. We suggest external validation of this model in additional ICUs to test its generalizability, followed by a clinical trial to determine if it results in improved outcomes for children with acute respiratory failure on BIPAP.

Data availability
The data comes from electronic medical records from Children's Hospital Los Angeles, and it is not publicly available as it contains sensitive information of patients.